1.数据集“Movies Dataset from Pirated Sites”的数据分析

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm, trange

1.1 将数据读入为DataFrame,并展示数据中的数据类型

In [23]:
movie_data = pd.read_csv('/content/data/MyDrive/data/movies_dataset.csv')
print(movie_data.head(5))
print(movie_data.dtypes)
   Unnamed: 0  IMDb-rating appropriate_for      director downloads      id  \
0           0          4.8               R     John Swab       304  372092   
1           1          6.4           TV-PG   Paul Ziller        73  372091   
2           2          5.2               R  Ben Wheatley     1,427  343381   
3           3          8.1             NaN  Venky Atluri     1,549  372090   
4           4          4.6             NaN  Shaji Kailas       657  372089   

              industry       language   posted_date release_date  run_time  \
0  Hollywood / English        English  20 Feb, 2023  Jan 28 2023       105   
1  Hollywood / English        English  20 Feb, 2023  Feb 05 2023        84   
2  Hollywood / English  English,Hindi  20 Apr, 2021  Jun 18 2021  1h 47min   
3            Tollywood          Hindi  20 Feb, 2023  Feb 17 2023       139   
4            Tollywood          Hindi  20 Feb, 2023  Jan 26 2023       122   

                                           storyline  \
0  Doc\r\n facilitates a fragile truce between th...   
1  Caterer\r\n Goldy Berry reunites with detectiv...   
2  As the world searches for a cure to a disastro...   
3  The life of a young man and his struggles agai...   
4  A man named Kalidas gets stranded due to the p...   

                                        title   views                  writer  
0                                Little Dixie   2,794               John Swab  
1  Grilling Season: A Curious Caterer Mystery   1,002  John Christian Plummer  
2                                In the Earth  14,419            Ben Wheatley  
3                                      Vaathi   4,878            Venky Atluri  
4                                       Alone   2,438        Rajesh Jayaraman  
Unnamed: 0           int64
IMDb-rating        float64
appropriate_for     object
director            object
downloads           object
id                   int64
industry            object
language            object
posted_date         object
release_date        object
run_time            object
storyline           object
title               object
views               object
writer              object
dtype: object

由上述数据类型和数据展示可以判断,“id”、“appropriate_for”、“language”、“industry”等属性为标称属性;“IMDb-rating”、“downloads”等属性为数值属性。

1.2 处理部分属性数据

可删去“Unnamed: 0”这个与数据分析无关的属性列。

In [24]:
movie_data = movie_data.drop(['Unnamed: 0'],axis = 1)

将“run_time”、“downloads”、“views”转为float型数据。

In [25]:
#转换downloads和views
for col in 'downloads', 'views':
  movie_data[col] = movie_data[col].str.replace(',','')
  movie_data[col] = movie_data[col].astype('float')

#转换run_time为以min为单位的float型数据
new = []
for item in movie_data['run_time']:
  if "h" in str(item) and ("min" in str(item) or "m" in str(item)):
    item = item.strip('min')
    hour, min = item.split('h ')
    time = float(hour)*60 + float(min)
    new.append(time)
  elif "h" in str(item):
    hour = item.strip('h')
    time = float(hour)*60
    new.append(time)
  elif "min" in str(item) or "m" in str(item):
    min = item.strip('min')
    time = float(min)
    new.append(time)
  else:
    new.append(float(item))
movie_data['run_time'] = new

将“posted_date”和“release_date”转为datetime数据。

In [26]:
movie_data['posted_date'] = pd.to_datetime(movie_data['posted_date'])
movie_data['release_date'] = pd.to_datetime(movie_data['release_date'])

1.2 展示数据摘要

(1)首先,检查统计的电影是否有重复,进行去重处理,将数值属性数据的最大值保留。

In [27]:
print(movie_data['id'].value_counts())
372090    402
371744    402
371877    402
372092    202
371991    202
         ... 
303381      1
303380      1
303379      1
303377      1
30459       1
Name: id, Length: 17086, dtype: int64
In [28]:
non_unique_id_mov_data = movie_data.groupby('id').agg(id_count=('id', 'count')).query('id_count > 1').sort_values(by='id_count', ascending=False).index.to_list()
for id in non_unique_id_mov_data:
  non_unique_id_df = movie_data.query('id == @id').groupby('id').agg(max_downloads=('downloads', 'max'), max_views=('views', 'max'))
  max_downloads = non_unique_id_df['max_downloads'].iloc[0]
  max_views = non_unique_id_df['max_views'].iloc[0]

  movie_data.loc[movie_data['id'] == id, 'downloads'] = max_downloads
  movie_data.loc[movie_data['id'] == id, 'views'] = max_views
movie_data = movie_data.drop_duplicates()

(2)接着,展示标称属性的值和频数统计,此处将给出每种可能值的频数统计,并且将展示频数的TOP20的柱状图。

In [29]:
nominal = ['appropriate_for','director','industry','language','writer','title','run_time','posted_date','release_date']

对于“appropriate_for”中的Unrated和Not Rated进一步处理:

In [30]:
movie_data['appropriate_for'] = movie_data['appropriate_for'].replace('Unrated', 'Not Rated')

接着画出柱状图:

In [31]:
def bar_plot(data, feature_list):
  for col in feature_list:
    print(data[col].value_counts())
    v = list(data[col].value_counts().values)
    id = list(data[col].value_counts().index)

    bar = plt.barh(id[:20],v[:20])
    plt.yticks(rotation = 45, size = 7)
    plt.xticks(rotation = 45, size = 7)
    plt.bar_label(bar, v[:20])
    plt.show()
In [32]:
bar_plot(movie_data,nominal[:-2])
R                 3752
Not Rated         2264
PG-13             1763
PG                 884
TV-MA              406
TV-14              293
G                  140
TV-PG              115
TV-G                99
TV-Y7               44
TV-Y                25
Approved             9
NC-17                4
TV-Y7-FV             3
Passed               3
MA-17                1
TV-13                1
Drama                1
Drama, Romance       1
18+                  1
Name: appropriate_for, dtype: int64
David Dhawan             32
Ram Gopal Varma          27
Vikram Bhatt             25
Steven Spielberg         24
Priyadarshan             22
                         ..
Mike Gan                  1
Santhosh P. Jayakumar     1
Johnny Mitchell           1
Vijay Yelakanti           1
Biren Nag                 1
Name: director, Length: 9672, dtype: int64
Hollywood / English    12197
Bollywood / Indian      2420
Anime / Kids            1028
Tollywood                769
Punjabi                  331
Stage shows              124
Pakistani                 92
Wrestling                 78
Dub / Dual Audio          45
3D Movies                  1
Name: industry, dtype: int64
English                                    10052
Hindi                                       1938
English,Spanish                              388
Punjabi                                      309
Telugu                                       297
                                           ...  
English,Spanish,Armenian                       1
Bosnian,English,Croatian,German,Serbian        1
English,Korean,Spanish                         1
Norwegian,Swedish                              1
Spanish,German,English                         1
Name: language, Length: 1168, dtype: int64
Naresh Kathuria                           11
Dheeraj Rattan                            11
Andrew Jones                              11
Jagdeep Singh                             10
Tyler Perry                               10
                                          ..
Jack Ketchum                               1
Kraig Wenman, Peter Sullivan               1
Robert Bruzio                              1
Lynn Shelton, Michael Patrick O'Brien      1
Khwaja Ahmad Abbas, Khwaja Ahmad Abbas     1
Name: writer, Length: 13603, dtype: int64
Pinocchio                             5
Alone                                 5
True Justice                          5
Sacrifice                             5
Blood Money                           4
                                     ..
The Professor                         1
A Violent Separation                  1
We Have Always Lived in the Castle    1
Valentine                             1
Madhumati                             1
Name: title, Length: 16572, dtype: int64
90.0     851
95.0     470
92.0     447
93.0     427
85.0     418
        ... 
51.0       1
26.0       1
28.0       1
32.0       1
271.0      1
Name: run_time, Length: 200, dtype: int64

统计“language”属性中具体语言种类与频数:

In [33]:
mov_language = movie_data['language'].astype('str')

for i in range(len(mov_language)):
  mov_language.iloc[i] = mov_language.iloc[i].split(',')
  for j in range(len(mov_language.iloc[i])):
    mov_language.iloc[i][j] = mov_language.iloc[i][j].lstrip(' ')

mov_language_dict = {}

for i in range(len(mov_language)):
  for j in range(len(mov_language.iloc[i])):
    if mov_language.iloc[i][j] not in mov_language_dict:
        mov_language_dict['{}'.format(mov_language.iloc[i][j])] = 0
    mov_language_dict['{}'.format(mov_language.iloc[i][j])] += 1
mov_language_dict = dict(sorted(mov_language_dict.items(),key = lambda x:x[1],reverse = True))
print(mov_language_dict)
{'English': 13055, 'Hindi': 2638, 'Spanish': 775, 'French': 643, 'nan': 534, 'Telugu': 412, 'Punjabi': 402, 'German': 372, 'Russian': 323, 'Tamil': 319, 'Italian': 251, 'Japanese': 212, 'Mandarin': 192, 'Arabic': 162, 'Urdu': 160, 'Latin': 109, 'Cantonese': 97, 'Chinese': 97, 'Marathi': 87, 'Korean': 80, 'Portuguese': 80, 'Kannada': 73, 'Malayalam': 73, 'Ukrainian': 70, 'Hebrew': 60, 'Thai': 54, 'Dutch': 46, 'Greek': 44, 'Polish': 43, 'Norwegian': 42, 'Swedish': 39, 'Danish': 36, 'Turkish': 36, 'Persian': 35, 'American Sign Language': 35, 'Bengali': 35, 'Romanian': 31, 'Vietnamese': 28, 'English Hindi': 28, 'Hungarian': 27, 'Serbian': 26, 'Indonesian': 26, 'Czech': 24, 'Panjabi': 22, 'Afrikaans': 20, 'Gujarati': 15, 'Icelandic': 15, 'Bulgarian': 13, 'Aboriginal': 13, 'Finnish': 12, 'Gaelic': 12, 'Irish Gaelic': 11, 'Welsh': 11, 'Xhosa': 11, 'Sanskrit': 11, 'Yiddish': 11, 'Pashtu': 11, 'Swahili': 10, 'Rajasthani': 10, 'Filipino': 9, 'None': 9, 'Bhojpuri': 9, 'North American Indian': 8, 'Hawaiian': 8, 'Somali': 8, 'Tagalog': 8, 'Armenian': 8, 'Irish': 8, 'Maori': 7, 'Haryanvi': 7, 'Malay': 7, 'Serbo-Croatian': 7, 'Central Khmer': 7, 'Swiss German': 6, 'Zulu': 6, 'Croatian': 6, 'Albanian': 6, 'Inuktitut': 6, 'Sign Languages': 6, 'Bosnian': 6, 'Esperanto': 6, 'Min Nan': 5, 'Nepali': 5, 'Estonian': 5, 'Tibetan': 5, 'Yoruba': 4, 'Norse': 4, 'Old': 4, 'Ancient (to 1453)': 4, 'Dari': 4, 'Mongolian': 4, 'Lithuanian': 4, 'Scottish Gaelic': 4, 'Old English': 4, 'Flemish': 4, 'Egyptian (Ancient)': 4, 'Scots': 4, 'Sindhi': 4, 'Kazakh': 3, 'Kashmiri': 3, 'Pushto': 3, 'Burmese': 3, 'Kurdish': 3, 'Saami': 3, 'Lingala': 3, 'Catalan': 3, 'Klingon': 3, 'Shanghainese': 3, 'Sindarin': 3, 'Akan': 2, 'Maltese': 2, 'Samoan': 2, 'Cree': 2, 'Tupi': 2, 'Macedonian': 2, 'Georgian': 2, 'Amharic': 2, 'Micmac': 2, 'Tswana': 2, 'Basque': 2, 'Quechua': 2, 'Navajo': 2, 'Ibo': 2, 'Nyanja': 2, 'Sotho': 2, 'Sioux': 2, 'Papiamento': 2, 'Slovenian': 2, 'Oriya': 2, 'Lao': 2, 'Southern Sotho': 2, 'Aramaic': 2, 'Brazilian Sign Language': 2, 'Tulu': 2, 'Maya': 2, 'Khmer': 2, 'Romany': 2, 'Awadhi': 2, 'Bambara': 1, 'Sumerian': 1, 'Ojibwa': 1, 'Chechen': 1, 'Shoshoni': 1, 'Nahuatl': 1, 'Bable': 1, 'Tonga (Tonga Islands)': 1, 'Spanish Sign Language': 1, 'Wolof': 1, 'Dinka': 1, 'Chaozhou': 1, 'Russian Sign Language': 1, 'Parsee': 1, 'Visayan': 1, 'Rhaetian': 1, 'Turkmen': 1, 'Middle English': 1, 'p': 1, 'British Sign Language': 1, 'Igbo': 1, 'Hokkien': 1, 'Nama': 1, 'Polynesian': 1, 'Cheyenne': 1, 'Uzbek': 1, 'Mohawk': 1, 'Kikuyu': 1, 'Azerbaijani': 1, 'Assamese': 1, 'Abkhazian': 1, 'Pawnee': 1, 'Australian Sign Language': 1, 'Konkani': 1, 'Syriac': 1, 'Greenlandic': 1, 'Indian Sign Language': 1, 'English.Hindi': 1, 'Quenya': 1, 'Sinhala': 1, 'Tok Pisin': 1, 'Sicilian': 1, 'Slovak': 1, 'Acholi': 1, 'Apache languages': 1, 'Low German': 1, 'Kru': 1, 'Mende': 1, 'Algonquin': 1}
In [34]:
x = list(mov_language_dict.keys())[:20]
y = list(mov_language_dict.values())[:20]
bar = plt.barh(x,y)
plt.yticks(rotation = 45, size = 7)
plt.xticks(rotation = 45, size = 7)
plt.bar_label(bar, y)
plt.show()

对于上映日期和平台发布日期,通过直方图展示其分布。

In [35]:
print(movie_data['release_date'].value_counts())
movie_data['release_date'].hist()
plt.title('release_date distribution')
plt.show()
print(movie_data['posted_date'].value_counts())
movie_data['posted_date'].hist()
plt.title('posted_date distribution')
plt.show()
1970-01-01    958
2022-10-07     29
2020-02-14     28
2019-12-06     28
2015-10-16     27
             ... 
1999-07-16      1
1994-07-15      1
1997-09-26      1
1993-09-17      1
1958-03-28      1
Name: release_date, Length: 4886, dtype: int64
1970-01-01    38
2014-01-03    24
2014-09-11    21
2013-02-25    20
2014-02-01    16
              ..
2009-05-26     1
2009-05-25     1
2009-05-18     1
2009-05-10     1
2010-03-27     1
Name: posted_date, Length: 4123, dtype: int64

(3)下一步,展示数值属性的5数概括及缺失值个数,并通过直方图展示数据分布及离群点。

In [36]:
numeric = ['views','downloads','IMDb-rating']
#利用describe函数给出5数概括
print(movie_data[numeric].describe().loc[['max', '75%', '50%', '25%', 'min']])
         views  downloads  IMDb-rating
max  1638533.0   391272.0          9.9
75%    43700.0    12224.0          6.5
50%    19207.0     3757.0          5.7
25%     9899.0     1287.0          4.7
min      667.0        0.0          1.1
In [37]:
# 绘制数值属性的直方图
def hist_plot(data, feature_list):
  for col in feature_list:
    data[col].hist(bins=12,histtype="bar",alpha=0.5)
    plt.title(f'Distribution of {col}')
    plt.show()
In [38]:
hist_plot(movie_data,numeric)

1.3 处理缺失值

In [39]:
#检查缺失值个数,展现缺失值比例
data_null = movie_data.isnull().sum(axis=0)
print(data_null.T)
rate = data_null/len(movie_data)
print(rate)
IMDb-rating         477
appropriate_for    7277
director           1566
downloads             1
id                    0
industry              1
language            534
posted_date           1
release_date          1
run_time           1000
storyline          1322
title                 1
views                 1
writer             1820
dtype: int64
IMDb-rating        0.027918
appropriate_for    0.425904
director           0.091654
downloads          0.000059
id                 0.000000
industry           0.000059
language           0.031254
posted_date        0.000059
release_date       0.000059
run_time           0.058527
storyline          0.077373
title              0.000059
views              0.000059
writer             0.106520
dtype: float64

由上述结果可知,appropriate_for属性几乎有一半的数据缺失,缺失率高达42.6%;storyline、run_time、director、writer属性的缺失值较多,分别为7.7%、5.9%、9.2%、10.7%。

用不同方法处理缺失值:

(1)将缺失部分剔除:按行删去含有任意空值的数据

In [40]:
drop_data = movie_data.dropna(axis=0)
print(f"未删除含有空值行前的数据数:{len(movie_data)}")
print(f"删除含有空值行后的数据数:{len(drop_data)}")
print(f"数据保留率:{len(drop_data)/len(movie_data)*100}%")
未删除含有空值行前的数据数:17086
删除含有空值行后的数据数:8650
数据保留率:50.62624370829919%

根据上述结果,可以发现,若删去含空值行,数据量将变为仅原先的一半,由先前的分析可以知道删去的数据大多是因为缺少“appropriate_for”值,显然这样的处理是不合理的。

可以根据删除前后的部分属性的柱状图来对比新旧数据差异:

In [49]:
def process_show(old_data, new_data, feature):
  feature_count = pd.DataFrame(old_data[feature].value_counts()).rename(columns={feature: 'feature_count'}).sort_values(by='feature_count', ascending=True)
  feature_count_cleaned = feature_count
  feature_count_cleaned['feature_count_cleaned'] = [0] * len(feature_count)

  for level in list(feature_count.index):
    if level in list(new_data[feature].value_counts().index):
      feature_count_cleaned.loc[[level], ['feature_count_cleaned']] = new_data[feature].value_counts().loc[[level]].values[0]
  plt.figure(figsize=(20, 20))
  # plt.barh(feature_count_cleaned.index, width=feature_count_cleaned['feature_count'], label='feature_count')
  # plt.barh(feature_count_cleaned.index, width=feature_count_cleaned['feature_count_cleaned'], label='feature_count_cleaned')
  bar1 = plt.barh([d+0.42 for d in list(range(len(feature_count_cleaned)))], tick_label=feature_count_cleaned.index, width=feature_count_cleaned['feature_count'], label='feature_count', height=0.4)
  bar2 = plt.barh(list(range(len(feature_count_cleaned))), tick_label=feature_count_cleaned.index, width=feature_count_cleaned['feature_count_cleaned'], label='feature_count_cleaned', height=0.4)
  plt.yticks(rotation = 45, fontsize=20)
  plt.xticks(rotation = 45, fontsize=20)
  plt.bar_label(bar1,feature_count_cleaned['feature_count'],fontsize=16)
  plt.bar_label(bar2,feature_count_cleaned['feature_count_cleaned'], fontsize=16)
  plt.title(f"Comparison of {feature}", fontsize=24)
  plt.legend(fontsize=24, loc='lower right')
  plt.show()
In [50]:
process_show(movie_data,drop_data,'appropriate_for')

(2)用最高频率值来填补缺失值

In [51]:
def fill_highest_fre(data, feature_list):
  dict = {}
  for col in feature_list:
    top = data[col].value_counts().index
    dict[col] = list(top)[0]
  new_data = data.fillna(dict)
  return new_data
In [59]:
fill_data = fill_highest_fre(movie_data,movie_data.columns)

对于标称属性,用最高频率值填充缺失值,并不会影响其分布,只是最高频率值的频率将增加,仍以appropriate_for属性为例:

In [66]:
process_show(movie_data,fill_data,'appropriate_for')

展示经过删除缺失值和用最高频率填充缺失值这两种方法的数值属性的盒图对比:

In [64]:
def box_compare(data1,data2, feature_list):
  for col in feature_list:
    l1 = list(data1[col])
    l2 = list(data2[col])
    labels = [f'{col} of drop_data',f'{col} of fill_data']
    plt.grid(True)
    plt.boxplot([l1,l2],
        medianprops={'color': 'red', 'linewidth': '1.5'},
        meanline=True,
        showmeans=True,
        meanprops={'color': 'blue', 'ls': '--', 'linewidth': '1.5'},
        flierprops={"marker": "o", "markerfacecolor": "red", "markersize": 7},
        labels=labels)
    plt.show()
In [65]:
box_compare(drop_data,fill_data,numeric)

(3)通过属性的相关关系来填补缺失值

针对缺失率最高的appropriate_for属性,我们尝试使用计算其与其他属性的相关系数来填补缺失值。

首先将appropriate_for的值进行编码,用spearman方法计算相关系数,并画出热力图。

In [53]:
code_data = pd.get_dummies(movie_data, columns=['appropriate_for'], dummy_na=True, drop_first=True)
plt.figure(figsize=(40, 40))
sns.heatmap(code_data.corr(method='spearman'), cmap='YlGnBu', annot=True)
plt.title('Correlation Analysis')
<ipython-input-53-342ad9129335>:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  sns.heatmap(code_data.corr(method='spearman'), cmap='YlGnBu', annot=True)
Out[53]:
Text(0.5, 1.0, 'Correlation Analysis')

我们可以从热图中看到,appropriate_for与其他属性是弱相关的(不大于0.3)。其中,我们可以发现IMDb-rating是所有属性中最相关的属性,故利用IMDb-ranking来完成这个补全。但,根据相关系数值我们可以猜测使用属性之间的相关性来替换列appropriate_for中的缺失值并不是好的处理。

In [54]:
def feature_corr_fill(data,fill_feature,corr_feature):
  mean_df = data.groupby(fill_feature).agg(avg=(corr_feature, 'mean'))
  corr_data = data.copy(deep=True)
  for i in trange(len(corr_data)):
    if corr_data[fill_feature].iloc[i] is np.nan:
      rate = corr_data[corr_feature].iloc[i]
      dists = []
      for j in range(len(mean_df)):
        dists.append(abs(mean_df.iloc[j]['avg']-rate))
      array = np.array(dists)
      sorted_index = np.argsort(array)
      idx = sorted_index[0]
      corr_data[fill_feature].iloc[i] = mean_df.index[idx]
  return corr_data
In [55]:
corr_data = feature_corr_fill(movie_data,'appropriate_for','IMDb-rating')
  0%|          | 0/17086 [00:00<?, ?it/s]<ipython-input-54-e58f7893bd26>:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  corr_data[fill_feature].iloc[i] = mean_df.index[idx]
100%|██████████| 17086/17086 [00:08<00:00, 2033.19it/s]

展示填充前后柱状图对比:

In [56]:
process_show(movie_data,corr_data,'appropriate_for')

(4)通过数据对象之间的相似性来填补缺失值

针对缺失率最高的appropriate_for属性,我们尝试使用计算数据对象间的相似性来填补缺失值。首先对数值属性进行归一化(regularit函数);接着,计算各个例子的数值属性的均方差,找到最相似的例子的appropriate_for的取值填充。

In [57]:
def sample_corr_fill(data, fill_feature, feature_list):

  def regularit(data,feature_list):
    new_df = pd.DataFrame(index=data.index)
    for col in feature_list:
        d = data[col]
        MAX = d.max()
        MIN = d.min()
        new_df[col] = ((d - MIN) / (d - MAX))
    return new_df

  def eucliDist(X,Y):
    return np.sqrt(sum(np.power((X - Y), 2)))
  def manhDist(X,Y):
    return np.sum(abs(X-Y))

  reg_data = regularit(data,feature_list)
  print("Show regularized data:")
  print(reg_data.head(20))
  no_nan_data = pd.concat([reg_data,data[fill_feature]],axis=1).dropna(axis=0,inplace=False)

  nan_data = pd.concat([reg_data,data[fill_feature]], axis=1)
  nan_data = nan_data[nan_data[fill_feature].isnull() == True]
  #print(nan_data.head(20))
  sample_corr_data = data.copy(deep=True)
  nan_index = nan_data.index
  no_nan_index = no_nan_data.index
  for id_nan in tqdm(nan_index,total=len(nan_data),desc='Calculating similarity...'):
    dists = {}
    for id_sample in no_nan_index:
      dist = manhDist(np.array(reg_data.loc[id_nan]),np.array(reg_data.loc[id_sample]))
      dists[id_sample] = dist
    order = sorted(dists.items(),key=lambda x:x[1],reverse=False)
    idx = order[0][0]
    sample_corr_data[fill_feature].loc[id_nan] = sample_corr_data[fill_feature].loc[idx]
  return sample_corr_data
In [62]:
sample_corr_data = sample_corr_fill(movie_data.iloc[0:3000],'appropriate_for',numeric)
Show regularized data:
       views  downloads  IMDb-rating
0  -0.005675  -0.001581    -0.725490
1  -0.000501  -0.000082    -1.514286
2  -0.020988  -0.005414    -0.872340
3  -0.010916  -0.009198    -3.888889
4  -0.002654  -0.002375    -0.660377
5  -0.003409  -0.002726    -0.955556
6  -0.017198  -0.021116          NaN
7  -0.026953  -0.006818    -1.588235
8  -0.009423  -0.001593    -1.933333
9  -0.013702  -0.007549    -0.543860
10 -0.005941  -0.002710    -1.378378
11 -0.056824  -0.054143    -8.777778
12 -0.007228  -0.008695          NaN
13 -0.061454  -0.061598    -1.666667
14 -0.008122  -0.001648    -2.142857
15 -0.232568  -0.193860    -1.588235
16 -0.018732  -0.010820          NaN
17 -0.058540  -0.051867    -1.514286
18 -0.000000  -0.000466          NaN
19 -0.017051  -0.005533    -0.692308
Calculating similarity...:   0%|          | 0/1643 [00:00<?, ?it/s]<ipython-input-57-c040e28d90ba>:35: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_corr_data[fill_feature].loc[id_nan] = sample_corr_data[fill_feature].loc[idx]
Calculating similarity...: 100%|██████████| 1643/1643 [05:12<00:00,  5.25it/s]

鉴于数据维度较大,运行时间过长,将只展示填充数据前3000行的填充效果:

In [63]:
process_show(movie_data.iloc[0:3000],sample_corr_data,'appropriate_for')

2.数据集“GitHub Dataset”的数据分析

2.1 将数据读入为DataFrame,并展示数据中的数据类型

In [67]:
github_data = pd.read_csv('/content/data/MyDrive/data/github_dataset.csv')
print(github_data.head(5))
print(github_data.dtypes)
                repositories  stars_count  forks_count  issues_count  \
0        octocat/Hello-World            0            0           612   
1  EddieHubCommunity/support          271          150           536   
2             ethereum/aleth            0            0           313   
3      localstack/localstack            0            0           290   
4        education/classroom            0          589           202   

   pull_requests  contributors language  
0            316             2      NaN  
1              6            71      NaN  
2             27           154      C++  
3             30           434   Python  
4             22            67     Ruby  
repositories     object
stars_count       int64
forks_count       int64
issues_count      int64
pull_requests     int64
contributors      int64
language         object
dtype: object

由上述数据类型和数据展示可以判断,“repositories”、“language”这两个属性为标称属性;“stars_count”、“forks_count”、“issues_count”、“pull_requests”、“contributors”属性为数值属性。

2.2 展示数据摘要

(1)首先,检查数据是否有重复,进行去重处理。

In [68]:
print(github_data['repositories'].value_counts())
kameshsampath/ansible-role-rosa-demos         2
aloisdeniel/bluff                             2
antoniaandreou/github-slideshow               2
jgthms/bulma-start                            2
artkirienko/hlds-docker-dproto                2
                                             ..
WhiteHouse/CIOmanagement                      1
0xCaso/defillama-telegram-bot                 1
ethereum/blake2b-py                           1
openfoodfacts/folksonomy_mobile_experiment    1
gamemann/All_PropHealth                       1
Name: repositories, Length: 972, dtype: int64
In [69]:
github_data = github_data.drop_duplicates()
print(github_data['repositories'].value_counts())
octocat/Hello-World                         1
ethereum/cable                              1
feross/blob-to-buffer                       1
studiomohawk/js-testing-boilerplates        1
studiomohawk/jekyll-theme-the_minimum       1
                                           ..
AnkushMalaker/speech-emotion-recognition    1
mikenikles/ghost-v3-google-cloud-storage    1
yourcelf/afgexplorer                        1
trailofbits/circuitous-benchmarks           1
gamemann/All_PropHealth                     1
Name: repositories, Length: 972, dtype: int64

(2)接着,展示标称属性的值和频数统计,此处将给出每种可能值的频数统计,并且将展示频数的TOP20的柱状图。

以下将处理“repositories”属性,将对该属性的值进行分词处理,例如“octocat/Hello-World”,将以“/”为分词符提取作者名“octcoat”和项目名“Hello-World”,并将作者名与项目名存入github_data数据中,分别为“author”和“project”,两者均为标称属性。

In [70]:
authors = []
projects = []
for item in github_data['repositories']:
    au , pro = item.split('/')
    authors.append(au)
    projects.append(pro)
github_data['author'] = authors
github_data['project'] = projects
print(github_data.head(5))
                repositories  stars_count  forks_count  issues_count  \
0        octocat/Hello-World            0            0           612   
1  EddieHubCommunity/support          271          150           536   
2             ethereum/aleth            0            0           313   
3      localstack/localstack            0            0           290   
4        education/classroom            0          589           202   

   pull_requests  contributors language             author      project  
0            316             2      NaN            octocat  Hello-World  
1              6            71      NaN  EddieHubCommunity      support  
2             27           154      C++           ethereum        aleth  
3             30           434   Python         localstack   localstack  
4             22            67     Ruby          education    classroom  
In [71]:
nominal = ['author','project','language']
bar_plot(github_data,nominal)
EddieHubCommunity    21
openfoodfacts        20
ethereum             18
bradtraversy         16
google               16
                     ..
Orbiter               1
motasimmakki          1
dylanbeattie          1
AnandBaburajan        1
gamemann              1
Name: author, Length: 296, dtype: int64
github-slideshow    11
dotfiles             5
.github              3
blog                 3
chat                 2
                    ..
booklee              1
DiscrimFarmer        1
bext                 1
deposit-verifier     1
All_PropHealth       1
Name: project, Length: 946, dtype: int64
JavaScript          237
Python              151
HTML                 65
Java                 43
TypeScript           34
CSS                  30
Dart                 30
C++                  28
Jupyter Notebook     28
Ruby                 24
Shell                24
C                    23
PHP                  16
Go                   15
Rust                 10
Swift                10
Objective-C           8
Kotlin                7
C#                    7
SCSS                  4
Makefile              4
Jinja                 3
Dockerfile            3
Solidity              3
CoffeeScript          3
AutoHotkey            3
Hack                  2
CodeQL                2
PowerShell            2
Vue                   2
Assembly              2
Vim Script            2
Perl                  2
Elixir                2
Gherkin               1
Cuda                  1
QMake                 1
Pawn                  1
CMake                 1
Oz                    1
QML                   1
ActionScript          1
Roff                  1
HCL                   1
R                     1
PureBasic             1
Smarty                1
Less                  1
Svelte                1
Haskell               1
SourcePawn            1
Name: language, dtype: int64

(3)下一步,展示数值属性的5数概括及缺失值个数,并通过直方图和盒图展示数据分布及离群点。

In [72]:
numeric = ['stars_count','forks_count','issues_count','pull_requests','contributors']
print(github_data[numeric].describe().loc[['max', '75%', '50%', '25%', 'min']])
     stars_count  forks_count  issues_count  pull_requests  contributors
max        995.0        973.0         612.0          567.0         658.0
75%         65.0         37.0           6.0            2.0           4.0
50%         12.0          6.0           2.0            0.0           2.0
25%          1.0          1.0           1.0            0.0           0.0
min          0.0          0.0           1.0            0.0           0.0
In [73]:
hist_plot(github_data,numeric)
In [74]:
def box_plot(data, feature_list):
  l = []
  for col in feature_list:
    ll = list(data[col])
    l.append(ll)

  plt.grid(True)
  plt.boxplot(l,
        medianprops={'color': 'red', 'linewidth': '1.5'},
        meanline=True,
        showmeans=True,
        meanprops={'color': 'blue', 'ls': '--', 'linewidth': '1.5'},
        flierprops={"marker": "o", "markerfacecolor": "red", "markersize": 7},
        labels=feature_list)
  plt.show()
In [75]:
box_plot(github_data,numeric)

2.3 处理缺失值

In [76]:
#检查缺失值个数,展现缺失值比例
data_null = github_data.isnull().sum(axis=0)
print(data_null.T)
rate = data_null/len(github_data)
print(rate)
repositories       0
stars_count        0
forks_count        0
issues_count       0
pull_requests      0
contributors       0
language         126
author             0
project            0
dtype: int64
repositories     0.00000
stars_count      0.00000
forks_count      0.00000
issues_count     0.00000
pull_requests    0.00000
contributors     0.00000
language         0.12963
author           0.00000
project          0.00000
dtype: float64

由以上结果可知,仅“language”属性含缺失值,接下来用4种方法填充“language”中的缺失值:

(1)将缺失部分剔除:按行删去含有任意空值的数据

In [77]:
drop_data = github_data.dropna(axis=0)
print(f"未删除含有空值行前的数据数:{len(github_data)}")
print(f"删除含有空值行后的数据数:{len(drop_data)}")
print(f"数据保留率:{len(drop_data)/len(github_data)*100}%")
未删除含有空值行前的数据数:972
删除含有空值行后的数据数:846
数据保留率:87.03703703703704%

由上面的结果可知,本数据的数据量较小,且数据中空值较少,因此删去空值后数据保留率可达到86%,是效果较为理想。

(2)用最高频率值来填补缺失值

In [78]:
fill_data = fill_highest_fre(github_data,['language'])
process_show(github_data, fill_data, 'language')

(3)通过属性的相关关系来填补缺失值

由相关系数热力图可以看出,“language”属性与其他的数值属性的相关度不高,可以猜测,利用属性的相关关系来填充数据效果将不一定理想。此处,我尝试选择利用“forks_count”属性来填充“language”属性的缺失值。

In [79]:
feature_corr_data = feature_corr_fill(github_data,'language','forks_count')
process_show(github_data,feature_corr_data,'language')
  0%|          | 0/972 [00:00<?, ?it/s]<ipython-input-54-e58f7893bd26>:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  corr_data[fill_feature].iloc[i] = mean_df.index[idx]
100%|██████████| 972/972 [00:00<00:00, 3661.19it/s]

(4)通过数据对象之间的相似性来填补缺失值

In [80]:
sample_corr_data = sample_corr_fill(github_data,'language',numeric)
Show regularized data:
    stars_count  forks_count  issues_count  pull_requests  contributors
0     -0.000000    -0.000000           inf      -1.258964     -0.003049
1     -0.374309    -0.182260     -7.039474      -0.010695     -0.120954
2     -0.000000    -0.000000     -1.043478      -0.050000     -0.305556
3     -0.000000    -0.000000     -0.897516      -0.055866     -1.937500
4     -0.000000    -1.533854     -0.490244      -0.040367     -0.113367
5     -0.000000    -0.000000     -0.388636      -0.000000     -0.004580
6     -0.000000    -0.000000     -0.363839      -0.406948     -0.211786
7     -0.000000    -0.000000     -0.325380      -0.389706     -1.366906
9    -24.512821    -0.422515     -0.300000      -0.029038     -0.394068
10    -0.004036    -0.000000     -0.234343      -0.000000     -0.000000
11    -0.000000    -0.000000     -0.229376      -0.017953     -0.329293
12    -0.305774    -0.255484     -0.207510      -0.032787     -0.000000
13    -0.000000    -0.000000     -0.159393      -0.010695     -0.345603
14    -0.000000    -0.001029     -0.150659      -0.005319     -0.000000
15    -0.005051    -0.002060     -0.150659      -0.053903     -0.388186
16    -0.000000    -1.390663     -0.137803      -0.010695     -0.010753
17    -0.000000    -0.569355     -0.117002      -0.029038     -0.039494
18    -0.000000   -56.235294     -0.112933      -0.040367           inf
19    -0.000000    -0.210199     -0.108893      -0.010695     -0.124786
20    -0.000000    -0.666096     -0.104882      -0.010695     -0.128645
Calculating similarity...:   0%|          | 0/126 [00:00<?, ?it/s]<ipython-input-57-c040e28d90ba>:35: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_corr_data[fill_feature].loc[id_nan] = sample_corr_data[fill_feature].loc[idx]
Calculating similarity...: 100%|██████████| 126/126 [00:16<00:00,  7.66it/s]
In [81]:
process_show(github_data,sample_corr_data,'language')